Skip to content

Add tool ordering replay artifact#153

Merged
ProfRandom92 merged 1 commit into
mainfrom
codex/add-tool-ordering-replay-artifact
May 20, 2026
Merged

Add tool ordering replay artifact#153
ProfRandom92 merged 1 commit into
mainfrom
codex/add-tool-ordering-replay-artifact

Conversation

@ProfRandom92

Copy link
Copy Markdown
Owner

Motivation

  • Add a deterministic artifact that validates whether explicit tool-ordering commitments survive replay reconstruction and to surface compact, deterministic evidence for tool-order drift.
  • This builds on the existing MCP trace replay, replay graph core, graph-diff, and capability-boundary artifact foundation without touching fixture payloads or runtime systems.

Description

  • New generator scripts/generate_tool_ordering_replay_artifact.py extracts only explicit structured tool-order and required-before relations, uses graph helpers normalize_edges, nodes_from_edges, compare_edges, and find_order_violations, and emits a stable JSON artifact at artifacts/tool_ordering_replay_results.json following the project schema and sorting rules.
  • Added committed artifact artifacts/tool_ordering_replay_results.json representing deterministic evidence per family/fixture and a global summary.
  • Added tests at tests/test_tool_ordering_replay_artifact.py that verify artifact existence, generator reproducibility, top-level schema, determinism/sanitization, manifest alignment, tool-order evidence behavior (including zero-data handling), and label discipline.
  • Changed files: scripts/generate_tool_ordering_replay_artifact.py, artifacts/tool_ordering_replay_results.json, tests/test_tool_ordering_replay_artifact.py.

Testing

  • Ran generator: python scripts/generate_tool_ordering_replay_artifact.py and verified it writes artifacts/tool_ordering_replay_results.json reproducibly.
  • Ran targeted tests: pytest tests/test_tool_ordering_replay_artifact.py -q (new test set passed), plus pytest targets used during validation: tests/test_capability_boundary_replay_artifact.py, tests/test_graph_diff_artifact.py, tests/test_replay_graph_core.py, tests/test_fixture_manifest.py, tests/test_failure_taxonomy.py, and the full suite; all tests passed (new tests and full suite green, total test run shown: 270 passed).
  • Per-project check: npm run check completed successfully in CI-local validation.
  • Scope note: deterministic tool-ordering replay artifact only; no fixture payload changes, no README changes, no workflow changes, no runtime/orchestration behavior, no new failure labels, and no LLM/embedding/vector/fuzzy/network behavior introduced.
  • Risks: extraction is intentionally conservative and only uses explicit structured tool-ordering fields (prose-expressed relations are ignored by design).

Codex Task

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a tool-ordering replay artifact along with a generation script and a comprehensive test suite. The script analyzes execution sequences from manifest fixtures to detect drift and ordering violations between original and reconstructed payloads. Feedback focuses on a critical logic error in sequence extraction that destroys temporal order, a fragility in edge generation from ordered lists, and the need for more robust testing using synthetic drift scenarios to verify the validation logic.

Comment on lines +120 to +129
def _extract_sequence(payloads: list[dict[str, Any]]) -> tuple[str, ...]:
sequence_values: set[str] = set()
for payload in payloads:
edges: list[tuple[str, str]] = []
required: list[tuple[str, str]] = []
_walk(payload, edges, required)
for left, right in edges:
sequence_values.add(left)
sequence_values.add(right)
return tuple(sorted(sequence_values))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _extract_sequence function contains a critical logic error. It extracts nodes from edges and then sorts them alphabetically (sorted(sequence_values)). This completely destroys the temporal order of tool calls found in the replay. As a result, find_order_violations (line 164) will only check if constraints match alphabetical order, failing to detect actual execution order drift.

Additionally, this function is inefficient as it re-traverses the entire payload set. I suggest refactoring the extraction to collect the observed sequence in a single pass while preserving the order of appearance.

Suggested change
def _extract_sequence(payloads: list[dict[str, Any]]) -> tuple[str, ...]:
sequence_values: set[str] = set()
for payload in payloads:
edges: list[tuple[str, str]] = []
required: list[tuple[str, str]] = []
_walk(payload, edges, required)
for left, right in edges:
sequence_values.add(left)
sequence_values.add(right)
return tuple(sorted(sequence_values))
def _extract_sequence(payloads: list[dict[str, Any]]) -> tuple[str, ...]:
sequence: list[str] = []
seen: set[str] = set()
def walk(obj):
if isinstance(obj, dict):
for key, value in obj.items():
if key in ORDERING_KEYS and isinstance(value, list):
for item in value:
node = item if isinstance(item, str) else _id_from_obj(item)
if isinstance(node, str) and node and node not in seen:
sequence.append(node)
seen.add(node)
walk(value)
elif isinstance(obj, list):
for item in obj:
walk(item)
for p in payloads:
walk(p)
return tuple(sequence)

Comment on lines +62 to +73
def _edges_from_ordered_list(value: list[object]) -> list[tuple[str, str]]:
ordered: list[str] = []
for item in value:
if isinstance(item, str) and item:
ordered.append(item)
continue
obj_id = _id_from_obj(item)
if obj_id is not None:
ordered.append(obj_id)
continue
return []
return [(ordered[i], ordered[i + 1]) for i in range(len(ordered) - 1)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of _edges_from_ordered_list is fragile. If a single item in the list is not a string and lacks an identifier key, the function returns an empty list, causing the entire sequence of tool calls to be discarded. It should instead skip invalid items to preserve the remaining ordering information.

def _edges_from_ordered_list(value: list[object]) -> list[tuple[str, str]]:
    ordered: list[str] = []
    for item in value:
        node = item if isinstance(item, str) else _id_from_obj(item)
        if isinstance(node, str) and node:
            ordered.append(node)
    return [(ordered[i], ordered[i + 1]) for i in range(len(ordered) - 1)]

Comment on lines +85 to +106
def test_tool_ordering_evidence_behavior() -> None:
artifact = _load_json(ARTIFACT_PATH)
fixtures = [fixture for family in artifact["families"] for fixture in family["fixtures"]]

with_data = [
fixture
for fixture in fixtures
if fixture["tool_ordering"]["original_edge_count"] > 0
or fixture["tool_ordering"]["replay_edge_count"] > 0
]

if with_data:
assert artifact["global_summary"]["fixtures_with_tool_ordering_data"] > 0
else:
assert artifact["global_summary"]["fixtures_with_tool_ordering_data"] == 0

drift_count = sum(1 for fixture in fixtures if fixture["tool_ordering"]["drift_detected"])
assert artifact["global_summary"]["fixtures_with_tool_ordering_drift"] == drift_count

if drift_count > 0:
assert any(fixture["tool_ordering"]["drift_detected"] for fixture in fixtures)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current tests for tool ordering evidence behavior only validate the 'zero-data' state of the current artifact. There are no tests that verify the generator's ability to actually detect drift or violations when provided with mismatched original and replay payloads. I recommend adding a test case with synthetic payloads that contain known ordering violations to ensure the validation logic is correct and remains so after refactoring.

@ProfRandom92 ProfRandom92 merged commit 2887f2c into main May 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant